Challenges for week 5¶

Now that we've seen how to use scikit-learn, statsmodels and LIME to do modelling, it's time for you to apply this knowledge. This week has three challenges.

Each challenge has three components:

  1. Programming: Applying one of the programming or data analysis steps in Python you learned in the tutorial
  2. Interpretation: Explaining what you are doing and interpreting the results of the data analysis in MarkDown
  3. Reflection: Connecting these concepts with the literature of the week in a short reflection (max 300 words)

Some important notes for the challenges:

  1. These challenges are a warming up, and help you get ready for class. Make sure to give them a try on all of them. If you get an error message, try to troubleshoot it (using Google often helps). If all else fails, go to the next challenge (but make sure to hand it in).
  2. While we of course like when you get all the answers right, the important thing is to exercise and apply the knowledge. So we will still accept challenges that may not be complete, as long as we see enough effort for each challenge. The rubric (see Canvas) reflects this.
  3. Delivering the challenge on time on Canvas assignment is critical, as it helps also prepare for the DA live session. Check on Canvas how to hand it in.

Facing issues?¶

We are constantly monitoring the issues on the GitHub to help you out. Don't hesitate to log an issue there, explaining well what the problem is, showing the code you are using, and the error message you may be receiving.

Important: We are only monitoring the repository in weekdays, from 9.30 to 17.00. Issues logged after this time will most likely be answered the next day. This means you should not wait for our response before submitting a challenge :-)

Challenge 1¶

Programming challenge¶

In DA5 and 6, we will work with a simulated dataset from an online store that created a set of campaigns. This dataset will be used to predict whether someone made purchases, or not (i.e., purchase ) and, if so, how much they spent (i.e.,order_euros). These predictions will be later used for targeting users. For example, the marketing manager wants to identify users that have a high likelihood of making a purchase to target advertising campaigns to increase their awareness of the store, and to identify users that are likely to spend more than 100 euros on the store, and give them discounts that other users would not receive.

These predictions will be made about the dependent variables (or targets): purchase and order_euros. Your independent variables (or features) will be relevant characteristics available on (or created from) the datasets below.

You will find three files in this folder (starting with da5-6):

  • da56_orders.csv.gz (or da56_orders.pkl.gz for a pickle version). This file contains a list of orders made on an online store. It contains the session_id that made the order, whether there was a purchase, and how many euros were spent in that order.
  • da56_sessions.csv.gz (or da56_sessions.pkl.gz for a pickle version). This file contains a list of online sessions in the website. For each session, it contains the details of the browser that visited the site (user_agent), when the visit started, what the referral was, and if the referral was part of a paid campaign. If a user had a profile when doing the purchase, this file will also contain their user id.
  • da56_users.jsonl. This file contains a list of registered users on the website, their name, when they created their profile, their age, the referral that they followed when creating their profile, and whether they are part of an exclusive club of preferential clients.

In this first challenge, you will load, inspeact, clean and visualize your own dataset by:

  1. Load and inspect each file using the steps you learned in the course
  2. Select the files that you will use for the challenge (tip: you have to use at least two of the three files above)
  3. Transform the files into the appropriate format and merge them appropriately
  4. Review the dataset that is part of the merge, and remove personally identifiable data
  5. Perform any additional data cleaning steps you believe are necessary (including creating binary variables for the models)
  6. Provide descriptive statistics for all the relevant dependent and independent variables
  7. Create charts for each dependent and independent variable, and at least one chart with an interesting bivariate relationship between one independent variable and one dependent variable

Tip: If you decide to use the da56_users dataset, you may end up with several missing values (for users that entered the website but did not have a profile, or made the purchase checking out as guest). You can decide whether to fill these missing values and keep the cases, if appropriate, or simply to work with a smaller dataset (only for registered users). Both ways are fine but, whatever you do, make sure to explain what you are doing and justify your choices.

Interpretation¶

Write a technical and a business interpretation for the creation of the dataset.

  • The technical interpretation should explain, step-by-step, what you are doing with the Python commands that you are using. Before each line of code, add a line in Markdown that outlines what is being done below, and why. Note: For this challenge, this technical interpretation should be combined with the programming steps above.
  • Create a business summary about the dataset for a stakeholder with a discussion on the key metrics in this dataset (e.g., descriptives for the independent and dependent variables).

Use Markdown formatting to make the summary clear and visually appealing, and constantly refer to the output of the code to substantiate your claims. If needed, you can combine the code with the summary (e.g., start the summary in markdown, run the relevant code in the next cell, and interpret the output using MarkDown in the next cell).

Reflection¶

In sections 1 and 2 of their article, Tràmer et al. (2017) discuss how unfair treatments can arise from irresponsible data usage, and provide a few examples. You are now working with digital trace data from a (fictitious) online store, and will later create models that will be used to target campaigns or give discounts to consumers. Please briefly discuss how this may create the risk for unfair treatment (and indicate specifically what that unfair treament would be, and why). Moving one step further, look at the variables in the original datasets (da56_orders, da56_sessions and da56_users). Select one that you believe that has the most potential for creating risks of unfair treatment, and explain why.

In [1]:
import pandas as pd
import seaborn as sns
%matplotlib inline

Orders¶

In [2]:
orders = pd.read_pickle('da56_orders.pkl.gz')
In [3]:
orders.columns
Out[3]:
Index(['session_id', 'order_euros', 'purchase'], dtype='object')
In [4]:
orders.dtypes
Out[4]:
session_id       int64
order_euros    float64
purchase       float64
dtype: object
In [5]:
orders.isna().sum()
Out[5]:
session_id     0
order_euros    0
purchase       0
dtype: int64
In [6]:
orders.head()
Out[6]:
session_id order_euros purchase
0 5555694754 150.0 1.0
11 5555694765 150.0 1.0
18 5555694772 150.0 1.0
19 5555694773 239.0 1.0
20 5555694774 150.0 1.0
In [22]:
orders[['order_euros', 'purchase']].describe()
Out[22]:
order_euros purchase
count 11339.000000 11339.0
mean 178.913661 1.0
std 54.820250 0.0
min 110.500000 1.0
25% 150.000000 1.0
50% 150.000000 1.0
75% 208.000000 1.0
max 487.000000 1.0

Sessions¶

In [7]:
sessions = pd.read_pickle('da56_sessions.pkl.gz')
In [8]:
sessions.columns
Out[8]:
Index(['session_id', 'session_timestamp', 'user_agent', 'referral',
       'paid_campaign', 'user_id'],
      dtype='object')
In [9]:
sessions.dtypes
Out[9]:
session_id                   object
session_timestamp    datetime64[ns]
user_agent                   object
referral                     object
paid_campaign               float64
user_id                      object
dtype: object
In [10]:
sessions.isna().sum()
Out[10]:
session_id               0
session_timestamp        0
user_agent               0
referral                 0
paid_campaign        21569
user_id              34877
dtype: int64
In [11]:
sessions.head()
Out[11]:
session_id session_timestamp user_agent referral paid_campaign user_id
0 5555694754 2021-09-20 06:59:09 Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2... google.com 1.0 NaN
1 5555694755 2021-09-23 19:07:17 Mozilla/5.0 (Windows; U; Windows NT 6.0) Apple... instagram.com 4.0 NaN
2 5555694756 2021-09-25 14:12:23 Mozilla/5.0 (Android 2.3.6; Mobile; rv:7.0) Ge... instagram.com NaN NaN
3 5555694757 2021-09-20 11:12:36 Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ... google.com NaN NaN
4 5555694758 2021-09-24 11:12:22 Opera/9.52.(X11; Linux i686; bho-IN) Presto/2.... google.com 2.0 NaN

Users¶

In [14]:
users = pd.read_json('da56_users.jsonl', lines=True)
In [15]:
users.columns
Out[15]:
Index(['user_info'], dtype='object')
In [16]:
users.head()
Out[16]:
user_info
0 {'id': '55885858', 'reg_name': 'Melissa Hanson...
1 {'id': '55885859', 'reg_name': 'Danielle Evans...
2 {'id': '55885860', 'reg_name': 'Erika Horton',...
3 {'id': '55885861', 'reg_name': 'Nicole Campbel...
4 {'id': '55885862', 'reg_name': 'Jessica Sanche...
In [17]:
users = pd.json_normalize(users['user_info'])
In [18]:
users.head()
Out[18]:
id reg_name age registration_date initial_referrer preferential_client
0 55885858 Melissa Hanson 21 2019-10-29 instagram.com NaN
1 55885859 Danielle Evans 20 2019-11-08 instagram.com NaN
2 55885860 Erika Horton 18 2021-01-27 google.com NaN
3 55885861 Nicole Campbell 46 2020-06-21 google.com NaN
4 55885862 Jessica Sanchez 64 2020-02-15 massey.com NaN
In [19]:
users.columns
Out[19]:
Index(['id', 'reg_name', 'age', 'registration_date', 'initial_referrer',
       'preferential_client'],
      dtype='object')
In [20]:
users.isna().sum()
Out[20]:
id                       0
reg_name                 0
age                      0
registration_date        0
initial_referrer         0
preferential_client    831
dtype: int64
In [21]:
users.dtypes
Out[21]:
id                      object
reg_name                object
age                      int64
registration_date       object
initial_referrer        object
preferential_client    float64
dtype: object
In [23]:
len(users)
Out[23]:
1000

Overall research design¶

  • Dependent variables: order_euros (continuous), purchase (binary)
  • Independent variables (or features): session-related information and if the user is a preferential client

Using partially the user information:

  • Every session has information (referral, paid_campaign, details of the user_agent)
  • Not everybody is logged in, but preferential clients log in and they can be meaningful info

Alternatives:

  • Use only sessions (ignore user information)
  • Do the data analysis only on logged users (remove all items from sessions that are from anonymous users)

RQs and hypotheses¶

RQ1: To what extent coming from a paid campaign influences the likelihood of a purchase?

H1: Users coming from a paid campaign are more likely to purchase than users not coming from a paid campaign

RQ2: To what extent coming from a paid campaign influences the amount spent (order euros)?

H2: Users coming from a paid campaign will spend more than users not coming from a paid campaign

Checklist for merging¶

  1. Merge orders with sessions by session_id (be careful that orders only contains people that purchased)
  2. Merge sessions with users (to retrieve only whether a user is a preferential client, or not)
In [24]:
orders.dtypes
Out[24]:
session_id       int64
order_euros    float64
purchase       float64
dtype: object
In [25]:
sessions.dtypes
Out[25]:
session_id                   object
session_timestamp    datetime64[ns]
user_agent                   object
referral                     object
paid_campaign               float64
user_id                      object
dtype: object
In [26]:
sessions['session_id'] = sessions['session_id'].apply(pd.to_numeric)
In [27]:
data = sessions.merge(orders, on='session_id', how='left')
In [28]:
len(sessions), len(orders), len(data)
Out[28]:
(50000, 11339, 50000)
In [30]:
data.head()
Out[30]:
session_id session_timestamp user_agent referral paid_campaign user_id order_euros purchase
0 5555694754 2021-09-20 06:59:09 Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2... google.com 1.0 NaN 150.0 1.0
1 5555694755 2021-09-23 19:07:17 Mozilla/5.0 (Windows; U; Windows NT 6.0) Apple... instagram.com 4.0 NaN NaN NaN
2 5555694756 2021-09-25 14:12:23 Mozilla/5.0 (Android 2.3.6; Mobile; rv:7.0) Ge... instagram.com NaN NaN NaN NaN
3 5555694757 2021-09-20 11:12:36 Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ... google.com NaN NaN NaN NaN
4 5555694758 2021-09-24 11:12:22 Opera/9.52.(X11; Linux i686; bho-IN) Presto/2.... google.com 2.0 NaN NaN NaN
In [31]:
data.dtypes
Out[31]:
session_id                    int64
session_timestamp    datetime64[ns]
user_agent                   object
referral                     object
paid_campaign               float64
user_id                      object
order_euros                 float64
purchase                    float64
dtype: object
In [32]:
users.dtypes
Out[32]:
id                      object
reg_name                object
age                      int64
registration_date       object
initial_referrer        object
preferential_client    float64
dtype: object
In [33]:
users = users.rename(columns={'id': 'user_id'})
In [34]:
users.dtypes
Out[34]:
user_id                 object
reg_name                object
age                      int64
registration_date       object
initial_referrer        object
preferential_client    float64
dtype: object
In [35]:
data_users = data.merge(users, on='user_id', how='left')
In [36]:
len(data_users), len(data), len(users)
Out[36]:
(50000, 50000, 1000)
In [37]:
data_users.head()
Out[37]:
session_id session_timestamp user_agent referral paid_campaign user_id order_euros purchase reg_name age registration_date initial_referrer preferential_client
0 5555694754 2021-09-20 06:59:09 Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2... google.com 1.0 NaN 150.0 1.0 NaN NaN NaN NaN NaN
1 5555694755 2021-09-23 19:07:17 Mozilla/5.0 (Windows; U; Windows NT 6.0) Apple... instagram.com 4.0 NaN NaN NaN NaN NaN NaN NaN NaN
2 5555694756 2021-09-25 14:12:23 Mozilla/5.0 (Android 2.3.6; Mobile; rv:7.0) Ge... instagram.com NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 5555694757 2021-09-20 11:12:36 Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ... google.com NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5555694758 2021-09-24 11:12:22 Opera/9.52.(X11; Linux i686; bho-IN) Presto/2.... google.com 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
In [39]:
data_users.isna().sum()
Out[39]:
session_id                 0
session_timestamp          0
user_agent                 0
referral                   0
paid_campaign          21569
user_id                34877
order_euros            38661
purchase               38661
reg_name               34877
age                    34877
registration_date      34877
initial_referrer       34877
preferential_client    47488
dtype: int64
In [ ]:
 

Key variables¶

Dependent variables:

  • order_euros
  • purchase

Independent variable(s):

  • paid_campaign (main IV)
  • preferential_client
  • referral
  • user_agent (operating system)

Data minimization¶

In [40]:
df = data_users[['session_id', 'order_euros', 'purchase', 'paid_campaign',
                'preferential_client', 'referral', 'user_agent']]

Data cleaning & transformation¶

In [41]:
df.head()
Out[41]:
session_id order_euros purchase paid_campaign preferential_client referral user_agent
0 5555694754 150.0 1.0 1.0 NaN google.com Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2...
1 5555694755 NaN NaN 4.0 NaN instagram.com Mozilla/5.0 (Windows; U; Windows NT 6.0) Apple...
2 5555694756 NaN NaN NaN NaN instagram.com Mozilla/5.0 (Android 2.3.6; Mobile; rv:7.0) Ge...
3 5555694757 NaN NaN NaN NaN google.com Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ...
4 5555694758 NaN NaN 2.0 NaN google.com Opera/9.52.(X11; Linux i686; bho-IN) Presto/2....
In [42]:
df.isna().sum()
Out[42]:
session_id                 0
order_euros            38661
purchase               38661
paid_campaign          21569
preferential_client    47488
referral                   0
user_agent                 0
dtype: int64
In [43]:
df['order_euros'].describe()
Out[43]:
count    11339.000000
mean       178.913661
std         54.820250
min        110.500000
25%        150.000000
50%        150.000000
75%        208.000000
max        487.000000
Name: order_euros, dtype: float64
In [44]:
df['order_euros'] = df['order_euros'].fillna(0)
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1938906212.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['order_euros'] = df['order_euros'].fillna(0)
In [45]:
df['order_euros'].describe()
Out[45]:
count    50000.000000
mean        40.574040
std         79.338489
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max        487.000000
Name: order_euros, dtype: float64
In [46]:
df['purchase'].value_counts()
Out[46]:
1.0    11339
Name: purchase, dtype: int64
In [47]:
df['purchase'] = df['purchase'].fillna(0)
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/3302342748.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['purchase'] = df['purchase'].fillna(0)
In [48]:
df['purchase'].value_counts()
Out[48]:
0.0    38661
1.0    11339
Name: purchase, dtype: int64
In [49]:
df['paid_campaign'].value_counts()
Out[49]:
1.0    7179
3.0    7157
4.0    7121
2.0    6974
Name: paid_campaign, dtype: int64
In [50]:
df['paid_campaign'] = df['paid_campaign'].fillna(0)
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/3905136625.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['paid_campaign'] = df['paid_campaign'].fillna(0)
In [51]:
df['paid_campaign'].value_counts()
Out[51]:
0.0    21569
1.0     7179
3.0     7157
4.0     7121
2.0     6974
Name: paid_campaign, dtype: int64
In [52]:
df['preferential_client'].value_counts()
Out[52]:
1.0    2512
Name: preferential_client, dtype: int64
In [53]:
df['preferential_client'] = df['preferential_client'].fillna(0)
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/925174362.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['preferential_client'] = df['preferential_client'].fillna(0)
In [54]:
df['preferential_client'].value_counts()
Out[54]:
0.0    47488
1.0     2512
Name: preferential_client, dtype: int64
In [55]:
df.head()
Out[55]:
session_id order_euros purchase paid_campaign preferential_client referral user_agent
0 5555694754 150.0 1.0 1.0 0.0 google.com Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2...
1 5555694755 0.0 0.0 4.0 0.0 instagram.com Mozilla/5.0 (Windows; U; Windows NT 6.0) Apple...
2 5555694756 0.0 0.0 0.0 0.0 instagram.com Mozilla/5.0 (Android 2.3.6; Mobile; rv:7.0) Ge...
3 5555694757 0.0 0.0 0.0 0.0 google.com Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ...
4 5555694758 0.0 0.0 2.0 0.0 google.com Opera/9.52.(X11; Linux i686; bho-IN) Presto/2....
In [56]:
df['referral'].value_counts()
Out[56]:
google.com               11168
instagram.com            11136
facebook.com             10938
smith.com                  135
johnson.com                104
                         ...  
pitts.org                    1
cunningham-wilson.com        1
williams-holloway.com        1
brown-stone.info             1
hodge.org                    1
Name: referral, Length: 8630, dtype: int64
In [57]:
def parse_variables(row):
    # Handling paid campaign
    row['paid_campaign_binary'] = 0
    if row['paid_campaign'] > 0:
        row['paid_campaign_binary'] = 1
        
    row['paid_campaign_1'] = 0
    row['paid_campaign_2'] = 0
    row['paid_campaign_3'] = 0
    row['paid_campaign_4'] = 0
    
    if row['paid_campaign'] == 1:
        row['paid_campaign_1'] = 1
        
    if row['paid_campaign'] == 2:
        row['paid_campaign_2'] = 1

    if row['paid_campaign'] == 3:
        row['paid_campaign_3'] = 1
        
    if row['paid_campaign'] == 4:
        row['paid_campaign_4'] = 1
        
    # Handling referral
    
    if 'google' in str(row['referral']).lower():
        row['referral_google'] = 1
    else:
        row['referral_google'] = 0

    if 'instagram' in str(row['referral']).lower():
        row['referral_instagram'] = 1
    else:
        row['referral_instagram'] = 0
 

    if 'facebook' in str(row['referral']).lower():
        row['referral_facebook'] = 1
    else:
        row['referral_facebook'] = 0
        
    # Handling user agent
    if 'mac' in str(row['user_agent']).lower():
        row['user_agent_mac'] = 1
    else:
        row['user_agent_mac'] = 0    
    
    if 'windows' in str(row['user_agent']).lower():
        row['user_agent_windows'] = 1
    else:
        row['user_agent_windows'] = 0  
        
    
    return row
In [58]:
df.iloc[0]
Out[58]:
session_id                                                    5555694754
order_euros                                                        150.0
purchase                                                             1.0
paid_campaign                                                        1.0
preferential_client                                                  0.0
referral                                                      google.com
user_agent             Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2...
Name: 0, dtype: object
In [59]:
parse_variables(df.iloc[0])
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['paid_campaign_binary'] = 0
/Users/t.b.araujouva.nl/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py:692: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['paid_campaign_binary'] = 1
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['paid_campaign_1'] = 0
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['paid_campaign_2'] = 0
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['paid_campaign_3'] = 0
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['paid_campaign_4'] = 0
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['paid_campaign_1'] = 1
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:27: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['referral_google'] = 1
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:34: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['referral_instagram'] = 0
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:40: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['referral_facebook'] = 0
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:46: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['user_agent_mac'] = 0
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:49: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['user_agent_windows'] = 1
Out[59]:
session_id                                                     5555694754
order_euros                                                         150.0
purchase                                                              1.0
paid_campaign                                                         1.0
preferential_client                                                   0.0
referral                                                       google.com
user_agent              Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2...
paid_campaign_binary                                                    1
paid_campaign_1                                                         1
paid_campaign_2                                                         0
paid_campaign_3                                                         0
paid_campaign_4                                                         0
referral_google                                                         1
referral_instagram                                                      0
referral_facebook                                                       0
user_agent_mac                                                          0
user_agent_windows                                                      1
Name: 0, dtype: object
In [60]:
parse_variables(df.iloc[100])
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['paid_campaign_binary'] = 0
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['paid_campaign_binary'] = 1
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['paid_campaign_1'] = 0
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['paid_campaign_2'] = 0
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['paid_campaign_3'] = 0
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['paid_campaign_4'] = 0
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['paid_campaign_1'] = 1
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:29: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['referral_google'] = 0
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:34: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['referral_instagram'] = 0
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:38: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['referral_facebook'] = 1
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:46: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['user_agent_mac'] = 0
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:49: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['user_agent_windows'] = 1
Out[60]:
session_id                                                     5555694854
order_euros                                                           0.0
purchase                                                              0.0
paid_campaign                                                         1.0
preferential_client                                                   0.0
referral                                                     facebook.com
user_agent              Opera/9.60.(Windows 95; lij-IT) Presto/2.9.188...
paid_campaign_binary                                                    1
paid_campaign_1                                                         1
paid_campaign_2                                                         0
paid_campaign_3                                                         0
paid_campaign_4                                                         0
referral_google                                                         0
referral_instagram                                                      0
referral_facebook                                                       1
user_agent_mac                                                          0
user_agent_windows                                                      1
Name: 100, dtype: object
In [61]:
df = df.apply(parse_variables, axis=1)
In [62]:
df.groupby(['paid_campaign_binary', 'paid_campaign'])['session_id'].count()
Out[62]:
paid_campaign_binary  paid_campaign
0                     0.0              21569
1                     1.0               7179
                      2.0               6974
                      3.0               7157
                      4.0               7121
Name: session_id, dtype: int64
In [63]:
df.groupby(['paid_campaign_1', 'paid_campaign'])['session_id'].count()
Out[63]:
paid_campaign_1  paid_campaign
0                0.0              21569
                 2.0               6974
                 3.0               7157
                 4.0               7121
1                1.0               7179
Name: session_id, dtype: int64
In [64]:
df.groupby(['paid_campaign_2', 'paid_campaign'])['session_id'].count()
Out[64]:
paid_campaign_2  paid_campaign
0                0.0              21569
                 1.0               7179
                 3.0               7157
                 4.0               7121
1                2.0               6974
Name: session_id, dtype: int64
In [65]:
df.groupby(['paid_campaign_3', 'paid_campaign'])['session_id'].count()
Out[65]:
paid_campaign_3  paid_campaign
0                0.0              21569
                 1.0               7179
                 2.0               6974
                 4.0               7121
1                3.0               7157
Name: session_id, dtype: int64
In [67]:
df.groupby(['paid_campaign_4', 'paid_campaign'])['session_id'].count()
Out[67]:
paid_campaign_4  paid_campaign
0                0.0              21569
                 1.0               7179
                 2.0               6974
                 3.0               7157
1                4.0               7121
Name: session_id, dtype: int64
In [70]:
df[df['referral_google']==1]['referral'].value_counts()
Out[70]:
google.com    11168
Name: referral, dtype: int64
In [71]:
df[df['referral_google']==0]['referral'].value_counts()
Out[71]:
instagram.com            11136
facebook.com             10938
smith.com                  135
johnson.com                104
williams.com               103
                         ...  
gonzales-brown.com           1
miller-graves.com            1
stuart-walker.info           1
keller-washington.org        1
hodge.org                    1
Name: referral, Length: 8629, dtype: int64
In [72]:
df[df['referral_instagram']==1]['referral'].value_counts()
Out[72]:
instagram.com    11136
Name: referral, dtype: int64
In [73]:
df[df['referral_instagram']==0]['referral'].value_counts()
Out[73]:
google.com               11168
facebook.com             10938
smith.com                  135
johnson.com                104
williams.com               103
                         ...  
gonzales-brown.com           1
miller-graves.com            1
stuart-walker.info           1
keller-washington.org        1
hodge.org                    1
Name: referral, Length: 8629, dtype: int64
In [74]:
df[df['referral_facebook']==1]['referral'].value_counts()
Out[74]:
facebook.com    10938
Name: referral, dtype: int64
In [75]:
df[df['referral_facebook']==0]['referral'].value_counts()
Out[75]:
google.com               11168
instagram.com            11136
smith.com                  135
johnson.com                104
williams.com               103
                         ...  
stuart-walker.info           1
keller-washington.org        1
bailey-blair.info            1
cameron.info                 1
hodge.org                    1
Name: referral, Length: 8629, dtype: int64
In [76]:
df[df['user_agent_mac']==1]['user_agent'].value_counts()
Out[76]:
Mozilla/5.0 (iPod; U; CPU iPhone OS 3_1 like Mac OS X; ru-UA) AppleWebKit/533.7.2 (KHTML, like Gecko) Version/3.0.5 Mobile/8B112 Safari/6533.7.2      1
Mozilla/5.0 (Macintosh; Intel Mac OS X 10 10_9 rv:6.0; ru-UA) AppleWebKit/535.36.1 (KHTML, like Gecko) Version/5.0.5 Safari/535.36.1                  1
Mozilla/5.0 (Macintosh; PPC Mac OS X 10 5_9 rv:6.0; ckb-IQ) AppleWebKit/531.1.7 (KHTML, like Gecko) Version/5.0 Safari/531.1.7                        1
Mozilla/5.0 (iPad; CPU iPad OS 10_3_3 like Mac OS X) AppleWebKit/533.2 (KHTML, like Gecko) FxiOS/17.9s6334.0 Mobile/71W933 Safari/533.2               1
Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10 10_8) AppleWebKit/534.2 (KHTML, like Gecko) Chrome/44.0.837.0 Safari/534.2                                 1
                                                                                                                                                     ..
Mozilla/5.0 (Macintosh; Intel Mac OS X 10 8_1 rv:3.0; gez-ER) AppleWebKit/534.32.7 (KHTML, like Gecko) Version/5.0 Safari/534.32.7                    1
Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; th-TH) AppleWebKit/532.5.4 (KHTML, like Gecko) Version/3.0.5 Mobile/8B117 Safari/6532.5.4      1
Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_5 like Mac OS X) AppleWebKit/534.2 (KHTML, like Gecko) FxiOS/11.5t8039.0 Mobile/77N124 Safari/534.2            1
Mozilla/5.0 (iPad; CPU iPad OS 6_1_6 like Mac OS X) AppleWebKit/531.2 (KHTML, like Gecko) CriOS/33.0.837.0 Mobile/08F889 Safari/531.2                 1
Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3 like Mac OS X; ce-RU) AppleWebKit/533.50.7 (KHTML, like Gecko) Version/4.0.5 Mobile/8B113 Safari/6533.50.7    1
Name: user_agent, Length: 14778, dtype: int64
In [77]:
df[df['user_agent_mac']==0]['user_agent'].value_counts()
Out[77]:
Mozilla/5.0 (compatible; MSIE 8.0; Windows 98; Trident/5.1)                                          40
Mozilla/5.0 (compatible; MSIE 5.0; Windows NT 5.1; Trident/5.0)                                      38
Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.2; Trident/4.0)                                      38
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 5.1; Trident/5.1)                                      37
Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.01; Trident/5.1)                                     36
                                                                                                     ..
Opera/8.41.(Windows CE; lij-IT) Presto/2.9.173 Version/11.00                                          1
Opera/8.28.(X11; Linux x86_64; kw-GB) Presto/2.9.164 Version/10.00                                    1
Mozilla/5.0 (Windows NT 6.0) AppleWebKit/532.2 (KHTML, like Gecko) Chrome/55.0.848.0 Safari/532.2     1
Opera/9.99.(Windows NT 10.0; ia-FR) Presto/2.9.165 Version/10.00                                      1
Opera/8.39.(Windows NT 6.1; sl-SI) Presto/2.9.166 Version/10.00                                       1
Name: user_agent, Length: 25107, dtype: int64
In [78]:
df[df['user_agent_windows']==1]['user_agent'].value_counts()
Out[78]:
Mozilla/5.0 (compatible; MSIE 8.0; Windows 98; Trident/5.1)                                                        40
Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.2; Trident/4.0)                                                    38
Mozilla/5.0 (compatible; MSIE 5.0; Windows NT 5.1; Trident/5.0)                                                    38
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 5.1; Trident/5.1)                                                    37
Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 5.01; Trident/4.1)                                                   36
                                                                                                                   ..
Opera/8.85.(Windows NT 6.0; nb-NO) Presto/2.9.173 Version/12.00                                                     1
Mozilla/5.0 (Windows; U; Windows NT 4.0) AppleWebKit/535.42.5 (KHTML, like Gecko) Version/5.1 Safari/535.42.5       1
Opera/9.30.(Windows NT 5.2; eo-US) Presto/2.9.184 Version/12.00                                                     1
Mozilla/5.0 (Windows; U; Windows NT 6.2) AppleWebKit/535.30.6 (KHTML, like Gecko) Version/5.0.2 Safari/535.30.6     1
Opera/8.39.(Windows NT 6.1; sl-SI) Presto/2.9.166 Version/10.00                                                     1
Name: user_agent, Length: 12626, dtype: int64
In [79]:
df[df['user_agent_windows']==0]['user_agent'].value_counts()
Out[79]:
Mozilla/5.0 (Android 2.3.7; Mobile; rv:61.0) Gecko/61.0 Firefox/61.0                                                                                  5
Mozilla/5.0 (Android 7.1; Mobile; rv:60.0) Gecko/60.0 Firefox/60.0                                                                                    5
Mozilla/5.0 (Android 4.1.2; Mobile; rv:55.0) Gecko/55.0 Firefox/55.0                                                                                  4
Mozilla/5.0 (Android 2.2.2; Mobile; rv:22.0) Gecko/22.0 Firefox/22.0                                                                                  4
Mozilla/5.0 (Android 2.3.6; Mobile; rv:19.0) Gecko/19.0 Firefox/19.0                                                                                  4
                                                                                                                                                     ..
Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_6 like Mac OS X) AppleWebKit/533.0 (KHTML, like Gecko) FxiOS/12.0o1808.0 Mobile/63R671 Safari/533.0            1
Mozilla/5.0 (iPhone; CPU iPhone OS 5_1_1 like Mac OS X) AppleWebKit/535.0 (KHTML, like Gecko) FxiOS/10.3t5107.0 Mobile/14Q726 Safari/535.0            1
Mozilla/5.0 (iPod; U; CPU iPhone OS 3_3 like Mac OS X; eo-US) AppleWebKit/533.37.2 (KHTML, like Gecko) Version/4.0.5 Mobile/8B115 Safari/6533.37.2    1
Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/531.2 (KHTML, like Gecko) FxiOS/14.8b4386.0 Mobile/07Z080 Safari/531.2            1
Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10 9_1 rv:2.0; cy-GB) AppleWebKit/535.37.4 (KHTML, like Gecko) Version/5.0.4 Safari/535.37.4                  1
Name: user_agent, Length: 27259, dtype: int64
In [ ]:
 
In [80]:
df.head()
Out[80]:
session_id order_euros purchase paid_campaign preferential_client referral user_agent paid_campaign_binary paid_campaign_1 paid_campaign_2 paid_campaign_3 paid_campaign_4 referral_google referral_instagram referral_facebook user_agent_mac user_agent_windows
0 5555694754 150.0 1.0 1.0 0.0 google.com Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2... 1 1 0 0 0 1 0 0 0 1
1 5555694755 0.0 0.0 4.0 0.0 instagram.com Mozilla/5.0 (Windows; U; Windows NT 6.0) Apple... 1 0 0 0 1 0 1 0 0 1
2 5555694756 0.0 0.0 0.0 0.0 instagram.com Mozilla/5.0 (Android 2.3.6; Mobile; rv:7.0) Ge... 0 0 0 0 0 0 1 0 0 0
3 5555694757 0.0 0.0 0.0 0.0 google.com Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ... 0 0 0 0 0 1 0 0 0 1
4 5555694758 0.0 0.0 2.0 0.0 google.com Opera/9.52.(X11; Linux i686; bho-IN) Presto/2.... 1 0 1 0 0 1 0 0 0 0

Final key variables¶

Dependent variables:

  • order_euros
  • purchase

Independent variable(s):

  • paid_campaign (main IV) --> either as paid_campaign_binary OR per type of paid_campaign (e.g., paid_campaign_1)
  • preferential_client
  • referral --> referral_google, referral_facebook, referral_instagram (reference: other types of referral)
  • user_agent (operating system) --> user_agent_mac, user_agent_windows (reference: other types of user_agent)

Note: Descriptives & Visualizations not covered in the solution videos this week, refer to DA3 and DA4.¶

Challenge 2¶

Programming challenge¶

Now that you have your dataset created (challenge 1), we would like you to focus on the variable purchase (binary variable) as the DV. You need to:

  1. Create one hypothesis for your model (i.e., how at least one IV influences the DV)
  2. Create one statistical model with statsmodels that (a) tests your hypothesis and (b) checks the influence of at least two other relevant control variables
  3. Use machine learning (scikit-learn) to create the same model (as in item 2), and use this model to run predictions (e.g., what is the likelihood of a purchase depending on different combinations of values for the IV and controls?)

Optional: Use LIME to explain the predictions created by the model, contrasting the importance of each of the features (IV or controls) in the model.

Interpretation¶

Write a technical and a business interpretation for the creation of the dataset.

  • The technical interpretation should explain, step-by-step, what you are doing with the Python commands that you are using. Before each line of code, add a line in Markdown that outlines what is being done below, and why. Note: For this challenge, this technical interpretation should be combined with the programming steps above.
  • Create a business summary of the results:
    • Provide details of whether your hypothesis was supported or rejected, and motivate this answer by referring to specific aspects in the models or predictions you made
    • Provide one recommendation to the business stakeholder (the marketing manager discussed in challenge 1)

Use Markdown formatting to make the summary clear and visually appealing, and constantly refer to the output of the code to substantiate your claims. If needed, you can combine the code with the summary (e.g., start the summary in markdown, run the relevant code in the next cell, and interpret the output using MarkDown in the next cell).

Reflection¶

Hind (2019) discusses four main groups interested in explainable AI. Select the three most relevant groups for this challenge, and explain, in the context of this challenge, why they would be interested by understanding the model being used, and what would they want to know about the model in order to be able to trust its predictions.

In [81]:
df.columns
Out[81]:
Index(['session_id', 'order_euros', 'purchase', 'paid_campaign',
       'preferential_client', 'referral', 'user_agent', 'paid_campaign_binary',
       'paid_campaign_1', 'paid_campaign_2', 'paid_campaign_3',
       'paid_campaign_4', 'referral_google', 'referral_instagram',
       'referral_facebook', 'user_agent_mac', 'user_agent_windows'],
      dtype='object')
In [82]:
df.dtypes
Out[82]:
session_id                int64
order_euros             float64
purchase                float64
paid_campaign           float64
preferential_client     float64
referral                 object
user_agent               object
paid_campaign_binary      int64
paid_campaign_1           int64
paid_campaign_2           int64
paid_campaign_3           int64
paid_campaign_4           int64
referral_google           int64
referral_instagram        int64
referral_facebook         int64
user_agent_mac            int64
user_agent_windows        int64
dtype: object

Model¶

RQ1: To what extent coming from a paid campaign influences the likelihood of a purchase?

H1: Users coming from a paid campaign are more likely to purchase than users not coming from a paid campaign

  • Dependent variable: purchase
  • Independent variable (main feature): paid campaign (binary)
  • Control variables (other features):
    • referral_google
    • referral_facebook
    • referral_instagram
    • user_agent_mac
    • user_agent_windows
In [83]:
df.groupby(['paid_campaign','paid_campaign_1'])['session_id'].count()
Out[83]:
paid_campaign  paid_campaign_1
0.0            0                  21569
1.0            1                   7179
2.0            0                   6974
3.0            0                   7157
4.0            0                   7121
Name: session_id, dtype: int64
In [84]:
df.groupby(['paid_campaign','paid_campaign_binary'])['session_id'].count()
Out[84]:
paid_campaign  paid_campaign_binary
0.0            0                       21569
1.0            1                        7179
2.0            1                        6974
3.0            1                        7157
4.0            1                        7121
Name: session_id, dtype: int64

Nature of the dependent variable: binary (0 or 1) = Logistic regression

In [85]:
df['purchase'].describe()
Out[85]:
count    50000.000000
mean         0.226780
std          0.418753
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: purchase, dtype: float64
In [86]:
df['purchase'].value_counts()
Out[86]:
0.0    38661
1.0    11339
Name: purchase, dtype: int64
In [87]:
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
import numpy as np
import lime
from lime import lime_tabular

Statistical testing¶

In [89]:
purchase_stat = sm.Logit(df['purchase'], 
                  sm.add_constant(df[['paid_campaign_binary', 'referral_google', 
                                     'referral_instagram', 'referral_facebook',
                                     'user_agent_mac', 'user_agent_windows']]))
In [90]:
purchase_results = purchase_stat.fit()
Optimization terminated successfully.
         Current function value: 0.389695
         Iterations 7
In [91]:
purchase_results.summary()
Out[91]:
Logit Regression Results
Dep. Variable: purchase No. Observations: 50000
Model: Logit Df Residuals: 49993
Method: MLE Df Model: 6
Date: Sun, 03 Oct 2021 Pseudo R-squ.: 0.2721
Time: 16:21:49 Log-Likelihood: -19485.
converged: True LL-Null: -26768.
Covariance Type: nonrobust LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
const -2.7171 0.038 -70.706 0.000 -2.792 -2.642
paid_campaign_binary 0.8210 0.027 30.635 0.000 0.768 0.874
referral_google 2.2565 0.034 67.334 0.000 2.191 2.322
referral_instagram -1.1772 0.049 -24.013 0.000 -1.273 -1.081
referral_facebook 0.6662 0.035 19.277 0.000 0.598 0.734
user_agent_mac 1.4991 0.033 45.874 0.000 1.435 1.563
user_agent_windows -0.7520 0.034 -21.952 0.000 -0.819 -0.685

Machine learning to make predictions¶

In [92]:
logit_clf = LogisticRegression(max_iter=1000, fit_intercept = True)
In [93]:
logit_clf.fit(df[['paid_campaign_binary', 'referral_google', 
                                     'referral_instagram', 'referral_facebook',
                                     'user_agent_mac', 'user_agent_windows']], df['purchase'])
Out[93]:
LogisticRegression(max_iter=1000)
  • paid_campaign_binary (0,1)
  • referral_google (0,1)
  • referral_instagram (0,1)
  • referral_facebook (0,1)
  • user_agent_mac (0,1)
  • user_agent_windows (0,1)
In [94]:
logit_clf.predict_proba([[0,0,0,0,0,0]])
Out[94]:
array([[0.93787015, 0.06212985]])
In [95]:
logit_clf.predict_proba([[1,0,0,0,0,0]])
Out[95]:
array([[0.86925987, 0.13074013]])
In [96]:
logit_clf.predict_proba([[0,1,0,0,0,0]])
Out[96]:
array([[0.61317359, 0.38682641]])
In [97]:
logit_clf.predict_proba([[0,0,1,0,0,0]])
Out[97]:
array([[0.9799683, 0.0200317]])
In [98]:
logit_clf.predict_proba([[0,0,0,0,0,1]])
Out[98]:
array([[0.96970812, 0.03029188]])
In [99]:
logit_clf.predict_proba([[1,1,0,0,0,0]])
Out[99]:
array([[0.41113328, 0.58886672]])
In [100]:
logit_clf.predict_proba([[1,1,0,0,1,0]])
Out[100]:
array([[0.13510515, 0.86489485]])
In [ ]:
 
In [101]:
data_lime_purchase = df[['paid_campaign_binary', 'referral_google', 
                                     'referral_instagram', 'referral_facebook',
                                     'user_agent_mac', 'user_agent_windows','purchase']]
In [102]:
class_names_purchase = data_lime_purchase.columns
X_data_lime_purchase = data_lime_purchase[['paid_campaign_binary', 'referral_google', 
                                     'referral_instagram', 'referral_facebook',
                                     'user_agent_mac', 'user_agent_windows']].to_numpy()
y_data_lime_purchase = data_lime_purchase['purchase'].to_numpy()
In [103]:
explainer = lime.lime_tabular.LimeTabularExplainer(
    X_data_lime_purchase, 
    feature_names=class_names_purchase, 
    verbose=True, 
    mode='classification')
In [105]:
print(X_data_lime_purchase[500])
exp = explainer.explain_instance(X_data_lime_purchase[500], logit_clf.predict_proba)
exp.show_in_notebook(show_table=True)
[1 0 1 0 1 0]
Intercept 0.43254461868119976
Prediction_local [0.23579199]
Right: 0.17179235340357854
In [109]:
exp = explainer.explain_instance(np.array([1,1,0,0,1,0]), logit_clf.predict_proba)
exp.show_in_notebook(show_table=True)
Intercept -0.07868311448183393
Prediction_local [0.75449158]
Right: 0.8648948455755271

Challenge 3¶

Programming challenge¶

Now that you have your dataset created (challenge 1), we would like you to focus on the variable order_euros as the DV. You need to:

  1. Create one hypothesis for your model (i.e., how at least one IV influences the DV)
  2. Create one statistical model with statsmodels that (a) tests your hypothesis and (b) checks the influence of at least two other relevant control variables
  3. Use machine learning (scikit-learn) to create the same model (as in item 2), and use this model to run predictions (e.g., what is the likelihood of a purchase depending on different combinations of values for the IV and controls?)

Optional: Use LIME to explain the predictions created by the model, contrasting the importance of each of the features (IV or controls) in the model.

Interpretation¶

Write a technical and a business interpretation for the creation of the dataset.

  • The technical interpretation should explain, step-by-step, what you are doing with the Python commands that you are using. Before each line of code, add a line in Markdown that outlines what is being done below, and why. Note: For this challenge, this technical interpretation should be combined with the programming steps above.
  • Create a business summary of the results:
    • Provide details of whether your hypothesis was supported or rejected, and motivate this answer by referring to specific aspects in the models or predictions you made
    • Provide one recommendation to the business stakeholder (the marketing manager discussed in challenge 1)

Use Markdown formatting to make the summary clear and visually appealing, and constantly refer to the output of the code to substantiate your claims. If needed, you can combine the code with the summary (e.g., start the summary in markdown, run the relevant code in the next cell, and interpret the output using MarkDown in the next cell).

Reflection¶

Hind (2019) briefly touches upon global and local approaches for model explanations. Looking at the models and predictions you created, review which aspects would fit into the idea of global explanations, and which would fit the idea of local explanations. Select at least one example for each, and write a brief local explanation for the model, and one global explanation for the model. Use the one of the three relevant groups identified in the reflection of challenge 2 to serve as your stakeholder for the explanation (and indicate in the reflection who the stakeholder is).

RQ2: To what extent coming from a paid campaign influences the amount spent (order euros)?

H2: Users coming from a paid campaign will spend more than users not coming from a paid campaign

Type of dependent variable: order_euros = continuous

In [110]:
df['order_euros'].describe()
Out[110]:
count    50000.000000
mean        40.574040
std         79.338489
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max        487.000000
Name: order_euros, dtype: float64
In [111]:
df['order_euros'].value_counts()
Out[111]:
0.0      38661
150.0     3310
200.0      594
250.0      328
114.5       76
         ...  
329.5        1
374.5        1
323.5        1
384.5        1
460.5        1
Name: order_euros, Length: 398, dtype: int64
In [112]:
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import numpy as np
import lime
from lime import lime_tabular
In [113]:
euros_stat = sm.OLS(df['order_euros'], 
                  sm.add_constant(df[['paid_campaign_binary', 'referral_google', 
                                     'referral_instagram', 'referral_facebook',
                                     'user_agent_mac', 'user_agent_windows']]))
In [114]:
euros_results = euros_stat.fit()
In [115]:
euros_results.summary()
Out[115]:
OLS Regression Results
Dep. Variable: order_euros R-squared: 0.278
Model: OLS Adj. R-squared: 0.278
Method: Least Squares F-statistic: 3202.
Date: Sun, 03 Oct 2021 Prob (F-statistic): 0.00
Time: 16:43:46 Log-Likelihood: -2.8150e+05
No. Observations: 50000 AIC: 5.630e+05
Df Residuals: 49993 BIC: 5.631e+05
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 6.4295 0.807 7.967 0.000 4.848 8.011
paid_campaign_binary 20.9800 0.609 34.453 0.000 19.786 22.174
referral_google 71.4277 0.824 86.708 0.000 69.813 73.042
referral_instagram -14.9127 0.824 -18.088 0.000 -16.529 -13.297
referral_facebook 18.6255 0.829 22.469 0.000 17.001 20.250
user_agent_mac 43.2900 0.812 53.312 0.000 41.698 44.882
user_agent_windows -16.3470 0.745 -21.930 0.000 -17.808 -14.886
Omnibus: 10538.016 Durbin-Watson: 2.004
Prob(Omnibus): 0.000 Jarque-Bera (JB): 20819.338
Skew: 1.277 Prob(JB): 0.00
Kurtosis: 4.863 Cond. No. 5.71


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [ ]:
 
In [116]:
ols_clf = LinearRegression(fit_intercept = True)
In [117]:
ols_clf.fit(df[['paid_campaign_binary', 'referral_google', 
                                     'referral_instagram', 'referral_facebook',
                                     'user_agent_mac', 'user_agent_windows']], df['order_euros'])
Out[117]:
LinearRegression()
In [119]:
ols_clf.predict([[0,0,0,0,0,0]])
Out[119]:
array([6.42952853])
In [120]:
ols_clf.predict([[1,0,0,0,0,0]])
Out[120]:
array([27.40951576])
In [121]:
ols_clf.predict([[0,0,1,0,0,0]])
Out[121]:
array([-8.48317663])
In [124]:
ols_clf.predict([[1,1,0,0,1,0]])
Out[124]:
array([142.12722111])
In [ ]:
 
In [125]:
data_lime_euros = df[['paid_campaign_binary', 'referral_google', 
                                     'referral_instagram', 'referral_facebook',
                                     'user_agent_mac', 'user_agent_windows', 'order_euros']]
In [126]:
class_names_euros = data_lime_euros.columns
X_data_lime_euros = data_lime_euros[['paid_campaign_binary', 'referral_google', 
                                     'referral_instagram', 'referral_facebook',
                                     'user_agent_mac', 'user_agent_windows',]].to_numpy()
y_data_lime_euros = data_lime_euros['order_euros'].to_numpy()
In [127]:
explainer = lime.lime_tabular.LimeTabularExplainer(
    X_data_lime_euros, 
    feature_names=class_names_euros, 
    class_names=['order_euros'], 
    verbose=True, 
    mode='regression',
    discretize_continuous=True)
In [128]:
exp = explainer.explain_instance(np.array([0,0,0,0,0,0]), ols_clf.predict)
exp.show_in_notebook(show_table=True)
Intercept 129.3470897147102
Prediction_local [6.47515026]
Right: 6.429528527853279
In [129]:
exp = explainer.explain_instance(np.array([1,0,0,0,0,0]), ols_clf.predict)
exp.show_in_notebook(show_table=True)
Intercept 108.4000656122036
Prediction_local [27.43146288]
Right: 27.409515755088165

Important exception¶

As LIME is a framework still in development, we are not sure if it will work in all computers and configurations. If by any chance you get an error message when running LIME, hand in the challenge anyway (showing the error message), and we will accept it as complete.

In [ ]: